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Abstract 

The ability to represent complex high dimen¬ 
sional probability distributions in a compact form 
is one of the key insights in the field of graphical 
models. Factored representations are ubiquitous 
in machine learning and lead to major computa¬ 
tional advantages. We explore a different type of 
compact representation based on discrete Fourier 
representations, complementing the classical ap¬ 
proach based on conditional independencies. We 
show that a large class of probabilistic graphical 
models have a compact Fourier representation. 

This theoretical result opens up an entirely new 
way of approximating a probability distribution. 

We demonstrate the significance of this approach 
by applying it to the variable elimination algo¬ 
rithm. Compared with the traditional bucket rep¬ 
resentation and other approximate inference al¬ 
gorithms, we obtain significant improvements. 

1. Introduction 

Probabilistic inference is a key computational challenge 
in statistical machine learning and artificial intelligence. 
Inference methods have a wide range of applications, 
from learning models to making predictions and informing 
decision-making using statistical models. Unfortunately, 
the inference problem is computationally intractable, and 
standard exact inference algorithms, such as variable elim¬ 
ination and junction tree algorithms have worst-case expo¬ 
nential complexity. 

The ability to represent complex high dimensional proba¬ 
bility distributions in a compact form is perhaps the most 
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Figure 1. An example of a decision tree representing a function 
/ : {x^,. ..,X7}^ U*. 

important insight in the field of graphical models. The fun¬ 
damental idea is to exploit (conditional) independencies be¬ 
tween the variables to achieve compact/acforer/ represen¬ 
tations, where a complex global model is represented as a 
product of simpler, local models. Similar ideas have been 
considered in the analysis of Boolean functions and logi¬ 
cal forms (Dechter, 1997), as well as in physics with low 
rank tensor decompositions and matrix product states rep¬ 
resentations (Jordan et al., 1999; Linden et al., 2003; Son- 
tag et al., 2008; Friesen & Domingos, 2015). 

Compact representations are also key for the develop¬ 
ment of efficient inference algorithms, including message¬ 
passing ones. Efficient algorithms can be developed when 
messages representing the interaction among many vari¬ 
ables can be decomposed or approximated with the prod¬ 
uct of several smaller messages, each involving a subset of 
the original variables. Numerous approximate and exact 
inference algorithms are based on this idea (Bahar et al., 
1993; Flerova et al., 2011; Mateescu et al., 2010; Gogate 
& Domingos, 2013; Wainwright et al., 2003; Darwiche & 
Marquis, 2002; Ihler et al., 2012; Hazan & Jaakkola, 2012). 

Conditional independence (and related factorizations) is 
not the only type of structure that can be exploited to 
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achieve compactness. For example, consider the weighted 
decision tree in Figure 1 . No two variables in the probabil¬ 
ity distribution in Figure 1 are independent of each other. 
The probability distribution cannot be represented by the 
product of simpler terms of disjoint domains and hence we 
cannot take advantage of independencies. The full proba¬ 
bility table needs 2^ = 128 entries to be represented ex¬ 
actly. Nevertheless, this table can be described exactly by 
8 simple decision rules, each corresponding to a path from 
the root to a leaf in the tree. 

In this paper, we explore a novel way to exploit compact 
representations of high-dimensional probability tables in 
(approximate) probabilistic inference algorithms. Our ap¬ 
proach is based on a (discrete) Fourier representation of 
the tables, which can be interpreted as a change of basis. 
Crucially, tables that are dense in the canonical basis can 
have a sparse Fourier representation. In particular, under 
certain conditions, probability tables can be represented 
(or well approximated) using a small number of Fourier 
coefficients. The Fourier representation has found numer¬ 
ous recent applications, including modeling stochastic pro¬ 
cesses (Rogers, 2000; Abbring & Salimans, 2012), mani¬ 
folds (Cohen & Welling, 2015), and permutations (Huang 
et al., 2009). Our approach is based on Fourier represen¬ 
tation on Boolean functions, which has found tremendous 
success in PAC learning (O’Donnell, 2008; Mansour, 1994; 
Blum et al., 1998; Buchman et al., 2012), but these ideas 
have not been fully exploited in the fields of probabilistic 
inference and graphical models. 

In general, a factor over n Boolean variables requires 
0(2") entries to be specified, and similarly the corre¬ 
sponding Fourier representation is dense in general, i.e., it 
has 0(2") non-zero coefficients. However, a rather sur¬ 
prising fact which was first discovered by Linial (Linial 
et al., 1993) is that factors corresponding to fairly general 
classes of logical forms admit a compact Fourier represen¬ 
tation. Linial discovered that formulas in Conjunctive Nor¬ 
mal Form (CNF) and Disjunctive Normal Form (DNF) with 
bounded width (the number of variables in each clause) 
have compact Fourier representations. 

In this paper, we introduce a novel approach for using ap¬ 
proximate Fourier representations in the field of probabilis¬ 
tic inference. We generalize the work of Linial to the case 
of probability distributions (the weighted case where the 
entries are not necessarily 0 or 1), showing that a large class 
of probabilistic graphical models have compact Fourier 
representation. The proof extends the Hastad’s Switch¬ 
ing Lemma (Hastad, 1987) to the weighted case. At a 
high level, a compact Fourier representation often means 
the weighted probabilistic distribution can be captured by 
a small set of critical decision rules. Hence, this notion is 
closely related to decision trees with bounded depth. 


Sparse (low-degree) Fourier representations provide an en¬ 
tirely new way of approximating a probability distribu¬ 
tion. We demonstrate the power of this idea by applying 
it to the variable elimination algorithm. Despite that it is 
conceptually simple, we show in Table 2 that the variable 
elimination algorithm with Fourier representation outper¬ 
forms Minibucket, Belief Propagation and MCMC, and is 
competitive and even outperforms an award winning solver 
HAK on several categories of the UAI Inference Challenge. 

2. Preliminaries 

2.1. Inference in Graphical Models 

We consider a Boolean graphical model over N Boolean 
variables {xi,X 2 , ■ ■ ■ ,Xn}- We use bold typed vari¬ 
ables to represent a vector of variables. For example, 
the vector of all Boolean variables x is written as x = 
{xi,X 2 , ■ ■ ■ ,Xn)'^- We also use xs to represent the im¬ 
age of vector x projected onto a subset of variables; xg = 
(xij,a;j 2 ,...,Xj^)^ where S' = {ji,...,4}. A probabilis¬ 
tic graphical model is defined as: 

1 1 ^ 

where each ijii : { — 1,—)• IR+ is called & factor, and 
is a function that depends on a subset of variables whose 
indices are in S^. Z = Il^i 4’i{^Si) is the normaliza¬ 
tion factor, and is often called the partition function. In this 
paper, we will use —1 and 1 to represent false and true. We 
consider two key probabilistic inference tasks; the compu¬ 
tation of the partition function Z (PR) and marginal prob¬ 
abilities Pr{e) = (Marginal), in which 

X ^ e means that x is consistent with the evidence e. 

The Variable Elimination Algorithm is an exact algorithm 
to compute marginals and the partition function for gen¬ 
eral graphical models. It starts with a variable ordering tt. 
In each iteration, it eliminates one variable by multiplying 
all factors involving that variable, and then summing that 
variable out. When all variables are eliminated, the factor 
remaining is a singleton, whose value corresponds to the 
partition function. The complexity of the VE algorithm de¬ 
pends on the size of the largest factors generated during the 
elimination process, and is known to be exponential in the 
tree-width (Gogate & Dechter, 2004). 

Detcher proposed the Mini-bucket Elimination Algorithm 
(Dechter, 1997), which dynamically decomposes and ap¬ 
proximates factors (when the domain of a product exceeds 
a threshold) with the product of smaller factors during the 
elimination process. Mini-bucket can provide upper and 
lower bounds on the partition function. The authors of (van 
Rooij et al., 2009; Smith & Gogate, 2013) develop fast op¬ 
erations similar to the East Eourier transformation, and use 
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it to speed up the exact inference. Their approaches do 
not approximate the probability distribution, which is a key 
difference from this paper. 

2.2. Hadamard-Fourier Transformation 

Hadamard-Fourier transformation has attracted a lot of at¬ 
tention in PAC Learning Theory. Table 1 provides an 
example where a function (j){x, y) is transformed into its 
Fourier representation. The transformation works by writ¬ 
ing (p{x, y) using interpolation, then re-arranging the terms 
to get a canonical term. The example can be generalized, 
and it can be shown that any function defined on a Boolean 
hypercube has an equivalent Fourier representation. 
Theorem 1. (Hadamard-Fourier Transformation) Every 
f : { — 1,1}" —> K can be uniquely expressed as a mul¬ 
tilinear polynomial, 

/(x) = ^ 

SC[n] iGS 

where each cs € M. This polynomial is referred to as the 
Hadamard-Fourier expansion of f. 

Here, [n] is the power set of {1,..., n}. Following stan¬ 
dard notation, we will write f(S) to denote the coefficient 
Cs and xs'(x) for the basis function IliGS**- ^ 
cial case, X0 = 1. Notice these basis functions are parity 
functions. We also call f{S) a degree-fc coefficient of / iff 
IIS'! = k. In our example in Table 1, the coefficient for basis 
function xy is f{{x, y}) = lifi - (l >2 - fa + fi), which 
is a degree-2 coefficient. 

We re-iterate some classical results on Fourier expansion. 
First, as with the classical (inverse) Fast Fourier Transfor¬ 
mation (FFT) in the continuous domain, there are similar 
divide-and-conquer algorithms (FFT and invFFT) which 
connect the table representation of / (e.g., upper left ta¬ 
ble, Table 1) with its Fourier representation (e.g., bottom 
representation. Table 1). Both FFT and invFFT run in time 
0{n ■ 2") for a function involving n variables. In fact, the 
length 2" vector of all function values and the length 2" 
vector of Fourier coefficients are connected by a 2"-by- 
2" matrix iT„, which is often called the n-th Hadamard- 
Fourier matrix. In addition, we have the Parseval’s identity 
for Boolean Functions as well: Ex[/(x)^] = 

3. Low Degree Concentration of Fourier 
Coefficients 
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f{x,y) = 

+f2 + (t>3 + fi) + \i-fl -f2 + f3 + f4:)x 

+ \ {-fi + </'2 - ^3 + My + ^ifi - (t>2 - fa + fA)xy. 

Table 1. (Upper Left) Function f : {—1,1}^ —> Ris represented 
in a table. (Upper Right) f is re-written using interpolation. (Bot¬ 
tom) The terms of the upper-right equation are re-arranged, which 
yields the Fourier expansion of function f. 


raises a natural question: what type of functions can be well 
approximated with a compact Fourier expansion? 

We first discuss which functions can be represented exactly 
in the Fourier domain with coefficients up to degree d. To 
answer this question, we show a tight connection between 
Fourier representations with bounded degree and decision 
trees with bounded depth. A decision tree for a weighted 
function / : {—1,1}" —^ Kisa tree in which each inner 
node is labelled with one variable, and has two out-going 
edges, one labelled with —1, and other one with 1. The 
leaf nodes are labelled with real values. When evaluating 
the value on an input x = xia;2 ... x„, we start from the 
root node, and follow the corresponding out-going edges 
by inspecting the value of one variable at each step, until 
we reach one of the leaf nodes. The value at the leaf node 
is the output for /(x). The depth of the decision tree is de¬ 
fined as the longest path from the root node to one of the 
leaf nodes. Figure 1 provides a decision tree representa¬ 
tion for a weighted Boolean function. One classical result 
(O’Donnell, 2008) states that if a function can be captured 
by a decision tree with depth d, then it can be represented 
with Fourier coefficients up to degree d: 

Theorem 2. Suppose f : {—1,1}" —>■ M can be repre¬ 
sented by a decision tree of depth d, then all the coeffi¬ 
cients whose degree are larger than d is zero in f's Fourier 
expansion: f{S) = Ofor all S such that IS"! > d. 


Fourier expansion replaces the table representation of a 
weighted function with its Fourier coefficients. For a func¬ 
tion with n Boolean variables, the complete table repre¬ 
sentation requires 2" entries, and so does the full Fourier 
expansion. Interestingly, many natural functions can be ap¬ 
proximated well with only a few Fourier coefficients. This 


We can also provide the converse of Theorem 2: 

Theorem 3. Suppose f : {—1,1}" —>■ K can be repre¬ 
sented by a Fourier expansion with non-zero coefficients 
up to degree d, then f can be represented by the sum of 
several decision trees, each of which has depth at most d. 















Variable Elimination in the Fourier Domain 


Theorem 2 and Theorem 3 provide a tight connection be¬ 
tween the Fourier expansion and the decision trees. This 
is also part of the reason why the Fourier representation is 
a powerful tool in PAC learning. Notice that the Fourier 
representation complements the classical way of approxi¬ 
mating weighted functions exploiting independencies. To 
see this, suppose there is a decision tree of the same struc¬ 
ture as in Figure 1, but has depth d. According to The¬ 
orem 2, it can be represented exactly with Fourier coeffi¬ 
cients up to degree d. In this specific example, the number 
of non-zero Fourier coefficients is 0(2^'^). Nonetheless, no 
two variables in figure 1 are independent with each other. 
Therefore, it’s not possible to decompose this factor into a 
product of smaller factors with disjoint domains (exploiting 
independencies). Notice that the full table representation 
of this factor has 0(2^ ) entries, because different nodes 
in the decision tree have different variables and there are 
0{2‘^) variables in total in this example. 

If we are willing to accept an approximate representation, 
low degree Fourier coefficients can capture an even wider 
class of functions. We follow the standard notion of e- 
concentration; 

Definition 1. The Fourier spectrum of f : {—1,1}” —>■ K 
is e-concentrated on degree up to k if and only ifWyk [f] = 

SsC[n],|S|>fc/(*^)^ < 

We say a CNF (DNF) formula has bounded width w if and 
only if every clause (term) of the CNF (DNF) has at most 
w literals. In the literatures outside of PAC Learning, this is 
also referred to as a CNF (DNF) with clause (term) length 
w. Linial (Linial et al., 1993) proved the following result: 
Theorem 4 (Linial). Suppose f : { — 1,1}” —>■ {—1,1} 
is computable by a DNF (or CNF) of width w, then 
f’s Fourier spectrum is e-concentrated on degree up to 
0{w\og{l/e)). 

Linial’s result demonstrates the power of Fourier represen¬ 
tations, since bounded width CNF’s (or DNF’s) include a 
very rich class of functions. Interestingly, the bound does 
not depend on the number of clauses, even though the 
clause-variable ratio is believed to characterize the hard¬ 
ness of satisfiability problems. 

As a contribution of this paper, we extend Linial’s results to 
a class of weighted probabilistic graphical models, which 
are contractive with gap 1 — p and have bounded width 
w. To our knowledge, this extension from the deterministic 
case to the probabilistic case is novel. 

Definition 2. Suppose /(x) : { — 1,1}” — >■ K+ is a 
weighted function, we say /(x) has bounded width w iff 
the number of variables in the domain of f is no more than 
w. We say /(x) is contractive with gap 1 — p (0 < 77 < 1) 
if and only if (1) for all x, /(x) < 1; (2) maxx /(x) = 1; 
(3) iffi^o) < L then /(xq) < 77. 


The first and second conditions are mild restrictions. For a 
graphical model, we can always rescale each factor prop¬ 
erly to ensure its range is within [ 0 , 1 ] and the largest ele¬ 
ment is 1. The approximation bound we are going to prove 
depends on the gap 1 — 77 . Ideally, we want 77 to be small. 
The class of contractive functions with gap 1 — 77 still cap¬ 
tures a wide class of interesting graphical models. For ex¬ 
ample, it captures Markov Logic Networks (Richardson & 
Domingos, 2006), when the weight of each clause is large. 
Notice that this is one of the possible necessary conditions 
we found success in proving the weight concentration re¬ 
sult. In practice, because compact Fourier representation is 
more about the structure of the weighted distribution (cap¬ 
tured by a series of decision trees of given depth), graphical 
models with large 77 could also have concentrated weights. 
The main theorem we are going to prove is as follows: 

Theorem 5. (Main) Suppose /(x) = Yi^i 
which every f is a contractive function with width w and 
gap 1 — 77 , then f’s Fourier spectrum is e-concentrated 
on degree up to 0 ( 7 U log(l/e) log,^ e) when 77 > 0 and 
0{w log(l/e)) when 77 = 0 . 

The proof of theorem 5 relies on the notion of random re¬ 
striction and our own extension to the Hastad’s Switching 
Lemma (Hastad, 1987). 

Definition 3. Let /(x) : {—1,1}" M and J be subset 
of all the variables Xi,..., x„. Let z be an assignment to 
remaining variables J — {—1,1}” \ J. Define /|j|z : 
{—1,1}'^ —>■ M. to be the restricted function of f on J by 
setting all the remaining variables in J according to z. 

Definition 4. (6-random restriction) A 5-random restric¬ 
tion of f{x) : { — 1,1}" —>■ M is defined as /| j|z, when ele¬ 
ments in J are selected randomly with probability 6, and z 
is formed by randomly setting variables in J to either — 1 
or 1. We also say J\'i is a 5-random restriction set. 

With these definitions, we proved our weighted extension 
to the Hastad’s Switching Lemma: 

Lemma 1. (Weighted Flastad’s Switching Lemma) Sup¬ 
pose /(x) = YYiLi iti which every fi is a contrac¬ 

tive function with width w and gap 1 — 77 . Suppose J|z is a 
5-random restriction set, then 

Pr (3 decision tree h with depth t, \\h — /j|z||oo < 7 ) 

in which u = |"log^ 7 I + 1 */0 < ij < 1 or u = 1 ifr] = 0 
and ||.||oo means max |.|. 

The formal proof of Lemma 1 is based on a clever gen¬ 
eralization of the proof by Razborov for the unweighted 
case (Razborov, 1995), and is deferred to the supplemen¬ 
tary materials. 
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Lemma 2. Suppose /(x) : { — 1,1}" —)■ M and |/(x)| < 
1. J|z is a 6-random restriction set. t S N, 7 > 0 and 
let eg = Pr{^3 decision tree h with depth t such that 
ll/|j|z ~ ^||oo ^ 7}> then the Fourier spectrum of f is 
4 (eo + (1 — -concentrated on degree up to 2t/5. 


Ssc[n],|S|>t f\j\2.{S)" 


Proof. We first bound Ej|z 
With probability 1 — eg, there is a decision tree h with depth 
t such that ||/|, 7 |z(x) — /i(x)||oo < 7- In this scenario. 


We get our claim J2\u\>2t/s Hu)'^ < 4((1 - £ 0 ) 7 ^ + eo)- 

□ 


Now we are ready to prove Theorem 5. Firstly suppose 77 > 
0, choose 7 = e/ 8 , which ensures 4(1 — £ 0 ) 7 ^ < 1/2 • e. 

Next choose <5 = 1/(16 mw + 1), f = C'log(l/e), which 
ensures 


E f\j\.isr= E {f\j\.is)-ks)) ■ 

SC[n],|S|>t SC[n],|S|>t 

This is because due to Theorem 2, h{S) = 0 for all S such 
that jS”! > t. Because |/|,7|z(x) — /i(x)| < 7 for all x, 
hence the right side of Equation 1 must satisfy 


Choose C large enough, such that 4 • 1/2 • < 1/2 • e. 

Now we have 4((1 — £ 0 ) 7 ^ + £ 0 ) < £■ At the same time, 
2t/5 = C'log(l/e)(16it7u + 1) = 0 ( 7 (;log(l/e) log^ e).' 

4. Variable Elimination in the Fourier 
Domain 


E {fbks) - h{s)y < E {f\jks)-h{s)y 

SC[ra],|S|>t SC[n] 


= E 


(/|,7|z(x) - h{x)y 


<r 


( 2 ) 


The second to the last equality of Equation 2 is due to the 
Parseval’s Identity. With probability eg, there are no deci¬ 
sion trees close to /|j|z- However, because |/|j|z| < 

we musthave X)sc[n],|S|>i/UN(‘S')^ < 1- Summarizing 

these two points, we have: 


We have seen above that a Eourier representation can pro¬ 
vide a useful compact representation of certain complex 
probability distributions. In particular, this is the case for 
distributions that can be captured with a relatively sparse 
set of Eourier coefficients. We will now show the practical 
impact of this new representation by using it in an infer¬ 
ence setting. In this section, we propose an inference al¬ 
gorithm which works like the classic Variable Elimination 
(VE) Algorithm, except for passing messages represented 
in the Eourier domain. 


E 


J|z 


E <(l-£o)7" + £o. 

SC[n],|S|>t 


Using a known result Ej|^ /|j|z(>S')^ = J2uc[n] Pt-{Un 

J = S} ■ f{Uy, we have: 


Ej|z E 

SC[n],|S|>t 

= E /^r{|t7n J| >f}-/(C/)^ 

C/C[n] 


= E [f\j\bsy 

SC[n],|S|>t 

(3) 


The classical VE algorithm consists of two basic steps - 
the multiplication step and the elimination step. The mul¬ 
tiplication step takes / and g, and returns / • g, while the 
elimination step sums out one variable Xi from / by return¬ 
ing /■ Hsnce, the success of the VE procedure in the 
Eourier domain depends on efficient algorithms to carry out 
the aforementioned two steps. A naive approach is to trans¬ 
form the representation back to the value domain, carry out 
the two steps there, then transform it back to Eourier space. 
While correct, this strategy would eliminate all the benefits 
of Eourier representations. 

Luckily, the elimination step can be carried out in the 
Eourier domain as follows: 


The distribution of random variable \U O J\ is 
Binomial(|C7|,(5). When \U\ > 2t/S, this vari¬ 

able has mean at least 2t, using Chernoff bound, 
Pr{\U n J| < f} < {f2lef <3/4. Therefore, 

(1 - eo)7' + eo > E M\U D J\ > t} ■ f{U f 

U<Z[n\ 

> E Pr{\UFJ\>t}- f{Uf 

U<^[n\,\U\>2t/S 

s E (1 - 1 ) ■ /(£)“- 

C/C[n],|C/|> 27 /i 5 ^ ^ 


Theorem 6. Suppose f has a Fourier expansion: /(x) = 
Ssc[ra] /(‘S')xs(x). Then the Fourier expansion for f = 
'ybxi f ^hen Xi is summed out is: X)sc[n] 
where f^S) = 2/(5) if i ^ S and f'{S) = 0 f/i G S. 

The proof is left to the supplementary materials. Erom The¬ 
orem 6 , one only needs a linear scan of all the Eourier co¬ 
efficients of / in order to compute the Eourier expansion 
for /. Suppose / has m non-zero coefficients in its 
Eourier representation, this linear scan takes time 0{m). 

*77 = 0 corresponds to the classical CNF (or DNF) case. 
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Figure 2. Weight concentration on low degree coefficients in the 
Fourier domain. Weight random 3-SAT instances, with 20 vari¬ 
ables and nc clauses (Left) Tj — 0.1, (Right) r/ — 0.6. 

There are several ways to implement the multiplication 
step. The first option is to use the school book multipli¬ 
cation. To multiply functions / and g, one multiplies ev¬ 
ery pair of their Fourier coefficients, and then combines 
similar terms. If / and g have mj and rrig terms in their 
Fourier representations respectively, this operation takes 
time 0{mfmg). As a second option for multiplication, 
one can convert / and g to their value domain, multiply 
corresponding entries, and then convert the result back to 
the Fourier domain. Suppose the union of the domains of 
/ and g has n variables (2" Fourier terms), the conversion 
between the two domains dominates the complexity, which 
is 0{n ■ 2"). Nonetheless, when / and g are relatively 
dense, this method could have a better time complexity than 
the school book multiplication. In our implementation, we 
trade the complexity between the aforementioned two op¬ 
tions, and always use the one with lower time complexity. 

Because we are working on models in which exact in¬ 
ference is intractable, sometimes we need to truncate the 
Fourier representation to prevent an exponential explosion. 
We implement two variants for truncation. One is to keep 
low degree Fourier coefficients, which is inspired by our 
theoretical observations. The other one is to keep Fourier 
coefficients with large absolute values, which offers us a 
little bit extra flexibility, especially when the whole graph¬ 
ical model is dominated by a few key variables and we 
would like to go over the degree limitations occasionally. 
We found both variants work equally well. 

5. Experiments 

5.1. Weight Concentration on Low Degree Coefficients 

We first validate our theoretical results on the weight con¬ 
centration on low-degree coefficients in Fourier represen¬ 
tations. We evaluate our results on random weighted 3- 
SAT instances with 20 variables. Small instances are cho¬ 
sen because we have to compute the full Fourier spectrum. 
The weighted 3-SAT instances is specified by a CNF and a 
weight T], Each factor corresponds to a clause in the CNF. 
When the clause is satisfied, the corresponding factor eval¬ 
uates to 1, otherwise evaluates to p. For each rj and the 


(a) Mixed. Field 0.01 (b) Mixed. Field 0.1 

Figure 3. Log-partition function absolute errors for 15x15 small 
scale Ising Grids. Fourier is for the VE Algorithm in the Fourier 
domain, mbe is for Mini-bucket Elimination. BP is for Belief 
Propagation. Large scale experiments are on the next page. 

number of clauses nc, we randomly generate 100 instances. 
For each instance, we compute the squared sum weight 
at each degree: Wk[f] = j2sc[n],\s\=k ■ Figure 2 

shows the median value of the squared sum weight over 
100 instances for given rj and nc in log scale. As seen from 
the figure, although the full representation involves coeffi¬ 
cients up to degree 20 (20 variables), the weights are con¬ 
centrated on low degree coefficients (up to 5), regardless of 
r], which is in line with the theoretical result. 

5.2. Applying Fourier Representation in Variable 
Elimination 

We integrate the Fourier representation into the variable 
elimination algorithm, and evaluate its performance as an 
approximate probabilistic inference scheme to estimate the 
partition function of undirected graphical models. We im¬ 
plemented two versions of the Fourier Variable Elimination 
Algorithm. One version always keeps coefficients with the 
largest absolute values when we truncate the representa¬ 
tion. The other version keeps coefficients with the lowest 
degree. Our main comparison is against Mini-Bucket Elim¬ 
ination, since the two algorithms are both based on vari¬ 
able elimination, with the only difference being the way 
in which the messages are approximated. We obtained the 
source code from the author of Mini-Bucket Elimination, 
which includes sophisticated heuristics for splitting factors. 
The versions we obtained are used for Maximum A Posteri¬ 
ori Estimation (MAP). We augment this version to compute 
the partition function by replacing the maximization oper¬ 
ators by summation operators. We also compare our VE 
algorithm with MCMC and Loopy Belief Propagation. We 
implemented the classical Ogata-Tanemura scheme (Ogata 
& Tanemura, 1981) with Gibbs transitions in MCMC to es¬ 
timate the partition function. We use the implementation in 
Lib DAI (Mooij, 2010) for belief propagation, with random 
updates, damping rate of 0.1 and the maximal number of 
iterations 1,000,000. Throughout the experiment, we con¬ 
trol the number of MCMC steps, the z-bound of Minibucket 
and the message size of Eourier VE to make sure that the 
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Category 

#ins 

Minibucket 

Fourier (max coef) 

Fourier (min deg) 

BP 

MCMC 

HAK 

bn2o-30-* 

18 

3.91 

1.21 • 10-^ 

1.36-10-^ 

0.94 • 10-^ 

0.34 

8.3 • 10-4 

grids2/50-* 

72 

5.12 

3.67- 10 ® 

7.81 • 10"® 

1.53 • 10-2 

- 

1.53-10-2 

grids2/75-* 

103 

18.34 

5.41 • 10”^ 

6.87-10-4 

2.94 • 10-2 


2.94- 10-2 

grids2/90-* 

105 

26.16 

2.23 10'^ 

5.71 • 10-3 

5.59 • 10-2 

- 

5.22 - 10-2 

blockmap_05* 

48 

1.25 • 10"® 

4.34 10-'^ 

4.34 lO-** 

0.11 

- 

8.73 - 10-9 

students_03* 

16 

2.85 • 10"® 

1.67 10 

1.67 10-'^ 

2.20 

_ 

3.17- 10-6 

mastermind_03* 

48 

7.83 

0.47 

0.36 

27.69 

_ 

4.35 - 10-6 

mastermind_04* 

32 

12.30 

3.63 10 

3.63 10-'^ 

20.59 

- 

4.03 - 10-6 

mastermind_05* 

16 

4.06 

2.56 • 10 ’’ 

2.56 • 10 

22.47 


3.02 - 10-6 

mastermind_06* 

16 

22.34 

3.89 • 10 ’’ 

3.89 • 10-'^ 

17.18 

_ 

4.5 - 10-6 

mastermind_10* 

16 

275.82 

5.63 

2.98 

26.32 

- 

0.14 


Table 2. The comparsion of various inference algorithms on several categories in UAI2010 Inference Challenge. The median differences 
in log partition function | logj^g ^approx — logio -^truel averaged over benchmarks in each category are shown. Fourier VE algorithms 
outperform Belief Propagation, MCMC and Minihucket Algorithm. #ins is the number of instances in each category. 


algorithms complete in reasonable time (several minutes). 

We first compare on small instances for which we can 
compute ground truth using the state-of-the-art exact infer¬ 
ence algorithm ACE (Darwiche & Marquis, 2002). We run 
on 15-by-15 Ising models with mixed coupling strengths 
and various held strengths. We run 20 instances for each 
coupling strength. For a fair comparison, we hx the size 
of the messages for both Fourier VE and Mini-bucket to 
= 1,024. Under this message size VE algorithms can¬ 
not handle the instances exactly. Figure 3 shows the results. 
The performance of the two versions of the Fourier VE al¬ 
gorithm are almost the same, so we only show one curve. 
Clearly the Fourier VE Algorithm outperforms the MCMC 
and the Mini-bucket Elimination. It also outperforms Be¬ 
lief Propagation when the held strength is relatively strong. 

In addition, we compare our inference algorithms on large 
benchmarks from the UAI 2010 Approximate Inference 
Challenge (UAI). Because we need the ground truth to 
compare with, we only consider benchmarks that can be 
solved by ACE (Darwiche & Marquis, 2002) in 2 hours 
time, and 8GB of memory. The second column of Table 2 
shows the number of instances that ACE completes with 
the exact answer. The 3rd to the 7th column of Table 2 
shows the result for several inference algorithms, includ¬ 
ing the Minibucket algorithm with Abound of 20, two ver¬ 
sions of the Fourier Variable Elimination algorithms, be¬ 
lief propagation and MCMC. To be fair with Minibucket, 
we set the message size for Fourier VE to be 1,048,576 
(2^°). Because the complexity of the multiplication step 
in Fourier VE is quadratic in the number of coefficients, 
we further shrink the message size to 1,024 (2^°) during 
multiplication. We allow 1,000,000 steps for burn in and 
another 1,000,000 steps for sampling in the MCMC ap¬ 
proach. The same with the inference challenge, we com¬ 
pare inference algorithms on the difference in the log parti¬ 
tion function | log .^approx — log -^truel- The table reports 


the median differences, which are averaged over all bench¬ 
marks in each category. If one algorithm fails to complete 
on one instance, we count the difference in partition func¬ 
tion as -Foo, so it is counted as the worst case when com¬ 
puting the median. For MCMC, means that the Ogata- 
Tanemura scheme did not hnd a belief state with substan¬ 
tial probability mass, so the result is way off when taking 
the logarithm. The results in Table 2 show that Fourier 
Variable Elimination algorithms outperform MCMC, BP 
and Minibucket on many categories in the Inference chal¬ 
lenge. In particular, Fourier VE works well on grid and 
structural instances. We also listed the performance of a 
Double-loop Generalized Belief Propagation (Heskes et ah, 
2003) in the last column of Table 2. This implementation 
won one category in the Inference challenge, and contains 
various improvements besides the techniques presented in 
the paper. We used the parameter settings for high preci¬ 
sion in the Inference challenge for HAK. As we can see, 
Fourier VE matches or outperforms this implementation in 
some categories. Unlike fully optimized HAK, Fourier VE 
is a simple variable elimination algorithm, which involves 
passing messages only once. Indeed, the median time for 
Fourier VE to complete on bn2o instances is about 40 sec¬ 
onds, while HAK takes 1800 seconds. We are researching 
on incorporating the Fourier representation into message 
passing algorithms. 

Next we evaluate their performance on a synthetically gen¬ 
erated benchmark beyond the capability of exact inference 
algorithms. For one instance of this benchmark, we ran¬ 
domly generate factors of size 3 with low coupling weights. 
We then add a backdoor structure to each instance, by en¬ 
forcing coupling factors of size 3 in which the 3 variables 
of the factor must take the same value. For these instances, 
we can compute the expected value of the partition func¬ 
tion and compare it with the output of the algorithms. We 
report the results on Figure 4. Here the experimental setup 
for each inference algorithm is kept the same as the previ- 
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(a) Independent backdoors (b) Linked backdoors 


Figure 4. Log-partition function absolute errors for Weighted 
Models with Backdoor Structure. 

ous algorithm. The Mini-bucket approach is not reported, 
as it performs very poorly on these instances. The perfor¬ 
mance of the two implementations of Fourier VE are again 
similar, so they are combined into one curve. These results 
show that the Fourier approach outperforms both MCMC 
and Belief Propagation, and suggest that it can perform ar¬ 
bitrarily better than both approaches as the size of the back¬ 
door increases. 

Finally, we compare different inference algorithms on a 
machine learning application. Here we learn a grid Ising 
model from data. The computation of the partition function 
is beyond any exact inference methods. Hence in order to 
compare the performance of different inference algorithms, 
we have to control the training data that are fit into the Ising 
Model, to be able to predict what the learned model looks 
like. To generate training pictures, we start with a template 
with nine boxes (shown in Figure 5a). The training pic¬ 
tures are of size 25 x 25, so the partition function cannot be 
computed exactly by variable elimination algorithms with 
message size 2^° = 1,048,576. Each of the nine boxes 
in the template will have a 50% opportunity to appear in 
a training picture, and the occurrences of the nine boxes 
are independent of each other. We further blur the training 
images with 5% white noise. Figures 5b and 5c show two 
examples of the generated training images. We then use 
these training images to learn a grid Ising Model; 



(a) Template (b) Train Pic 1 (c) Train Pic 2 


(d) Fourier (e) MCMC (f) mbe (g) Mean Field 
Figure 5. Comparison of several inference algorithms on comput¬ 
ing the marginal probabilities of an Ising model learned from syn¬ 
thetic data, (a) The template to generate training images and (b,c) 
two example images in the training set. (d,e,f,g) The marginal 
probabilities obtained via four inference algorithms. Only the 
Fourier algorithm captures the fact that the 9 boxes are presented 
half of the time independently in the training data. 

the learned model (white means the probability is close to 
1, black means close to 0). Both the Minibucket and the 
Fourier VE keep a message size of 2^° = 1, 048, 576, so 
they cannot compute the marginals exactly. Fourier VE 
keeps coefficients with largest absolute value during mul¬ 
tiplication. For pixels outside of the nine boxes, in most 
circumstances they are black in the training images. There¬ 
fore, their marginals in the learned model should be close 
to 0. For pixels within the nine boxes, half of the time 
they are white in the training images. Hence, the marginal 
probabilities of these pixels in the learned model should be 
roughly 0.5. We validated the two aforementioned empir¬ 
ical observations on images with small size which we can 
compute the marginals exactly. As we can see, only the 
Fourier Variable Elimination Algorithm is able to predict a 
marginal close to 0.5 on these pixels. The performance of 
the MCMC algorithm (a Gibbs sampler, updating one pixel 
at a time) is poor. The Minibucket Algorithm has noise on 
some pixels. The marginals of the nine boxes predicted by 
mean field are close to 1, a clearly wrong answer. 




Pr(x) = ^ exp ^ GiXi + ^ hjXiXj , 

where V, E are the node and edge set of a grid, respec¬ 
tively. We train the model using contrastive divergence 
(Hinton, 2002), with fc = 15 steps of blocked Gibbs up¬ 
dates, on 20,000 such training images. (As we will see, 
vanilla Gibbs sampling, which updates one pixel at a time, 
does not work well on this problem.) We further encour¬ 
age a sparse model by using a LI regularizer. Once the 
model is learned, we use inference algorithms to compute 
the marginal probability of each pixel. Figure 5d 5e 5f 
and 5g show the marginals computed for the Fourier VE, 
MCMC, Minibucket Elimination, and the Mean Field on 


6. Conclusion 

We explore a novel way to exploit compact representations 
of high-dimensional probability distributions in approxi¬ 
mate probabilistic inference. Our approach is based on 
discrete Fourier Representation of weighted Boolean Func¬ 
tions, complementing the classical method of exploiting 
conditional independence between the variables. We show 
that a large class of weighted probabilistic graphical mod¬ 
els have a compact Fourier representation. This theoretical 
result opens up a novel way of approximating probability 
distributions. We demonstrate the significance of this ap¬ 
proach by applying it to the variable elimination algorithm, 
obtaining very encouraging results. 
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Supplementary Materials 

Proof of Lemma 1 

Let be the collection of restrictions on n Boolean vari¬ 
ables xi,..., Xn- Each restriction in leaves a set of I 
variables J = ... ,Xi^} open, while it hxes all other 

variables Xi ^ J to either -1 or 1. It is easy to see that the 
size of is given by: 

|i?Ll= Q -2"-'. (4) 


For a restriction J\z G call J|z bad if and only if for all 

decision tree h with depth t, there exists at least one input 
xj, such that |h(xj) - /|j|z(xj)| > 7. Let be the set 
of all bad restrictions, ie: B\^ = {J|z S : J|z is bad}. 
To prove the lemma, it is sufficient to prove that 


1 

\Ri\ - 2 



(5) 


In the proof that follows, for every bad restriction p G 
we establish a bijection between p and (^, s), in which ^ is 
a restriction in and s is a certificate from a witness set 
A. In this case, the number of distinct p’s is bounded by 
the number of (^, s) pairs: 

\Bl\<\R^-^\-\A\. ( 6 ) 


For a restriction p, we form the canonical decision tree for 
/Ip under precision 7 as follows: 

1. We start with a hxed order for the variables and an¬ 
other hxed order for the factors. 

2 . If /Ip is a constant function, or 11/Ip 11 00 < 7, stop. 

3. Otherwise, under restriction p, some factors evaluate 
to hxed values (all variables in these factors are hxed 
or there are free variables, but all assignments to these 
free variables lead to value 1 ), while other factors do 
not. Examine the factors according to the hxed factor 
order until reaching the hrst factor that still does not 
evaluate to a hxed value. 

4. Expand the open variables of this factor, under the 
hxed variable order specihed in step 1. The result will 
be a tree (The root branch is for the hrst open variable. 
The branches in the next level is for the second open 
variable, etc). 

5. Each leaf of this tree corresponds to /|piri, in which tti 
is a value restriction for all open variables of the factor. 
Recursively apply step 2 to 5 for function /Iprri, until 
the condition in step 2 holds. Then attach the resulting 
tree to this leaf. 



Figure 6. A graphical illustration of a canonical decision tree. 


Figure 6 provides a graphical demonstration of a canonical 
decision tree. 

Now suppose restriction p is bad. By dehnition, for any 
decision tree of depth t, there exists at least one input x, 
such that |/i(x) — /|p(x)| > 7 . The canonical decision 
tree is no exception. Therefore, there must be a path I in 
the canonical decision tree of /|p, which has more than t 
variables. Furthermore, these t variables can be split into k 
(1 < k < t) segments, each of which corresponds to one 
factor. Let fi(i G fcj) be these factors, and let be 

the assignments of the free variables for fi in path 1. Now 
for each factor fi, by the dehnition of the canonical deci¬ 
sion tree, under the restriction pTTi... iTi-i, fi\pTTi ... tti-i 
must have a branch whose value is no greater than rj (other¬ 
wise /ilpTTi... TTi-i all evaluates to 1). We call this branch 
the “compressing” branch for factor /ilptri.. Let 

the variable assignment which leads to this compressing 
branch for /i|p 7 ri... 7ri_i be 17 ^. Let a = ai... a^- Then 
we map the bad restriction p to pa and an auxiliary advice 
string that we are going to describe. 

It is self-explanatory that we can map from any bad restric¬ 
tion p to per. The auxiliary advice is used to establish the 
backward mapping, i.e. the mapping from pa to p. When 
we look at the result of /|pcr, we will notice that at least 
one factor is set to its compressing branch (because we 
set fi to its compressing branch in the forward mapping). 
Now there could be other factors set at their compressing 
branches (because of p), but an important observation is 
that: the number of factors at their compressing branches 
cannot exceed u = [log^ 7 ] + 1 , because otherwise, the 
other u — 1 factors already render ||/|p||oo < 7 , and the 
canonical decision tree should have stopped on expanding 
this branch. We therefore could record the index number of 
fi out of all the factors that are fixed at their compressing 
branches in the auxiliary advice string, so we can hnd fi in 
the backward mapping. Notice that this index number will 
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be between 1 and u, so it takes log u bits to store it. 

Now with the auxiliary information, we can identify which 
factor is fi. The next task is to identify which variables in 
fi are fixed by p, and which are fixed by cri. Moreover, 
if one variable is fixed by cri, we would like to know its 
correct values in tti. To do this, we introduce additional 
auxiliary information: for each factor fi, suppose it has 
free variables under restriction fi\pTTi ■ ■ ■ we use 
integers to mark the indices of these free variables. Because 
each fi is of width at most w, every integer of this type is 
between 1 and w (therefore can be stored in logtu bits). 
Also, it requires t integers of this type in total to keep this 
information, because we have t free variables in total for 
fi, ■■■Jk- 

Notice that it is not sufficient to keep these integers. We 
further need k — I separators, which tell which integer be¬ 
longs to which factor fi. Aligning these integers in a line, 
we need k — 1 separators to break the line into k segments. 
These separators can be represented by f — 1 bits, in which 
the z-th bit is 1 if and only if there is a separator between 
the i-th and (zH-l)-th integer (we have t integers at most). 
With these two pieces of information, we are able to know 
the locations of free variables set by ai for each factor fi. 

We further need to know the values for each variable in tt^. 
Therefore, we add in another f-bit string, each bit is either 0 
or 1. 0 means the assignment of the corresponding variable 
in TTi is the same as the one in ai, 1 means the opposite. 

With all this auxiliary information, we can start from pa, 
find the first factor fi, further identify which variables are 
set by ai in fi, and set back its values in tti. Then we 
start with /[tti, we can find 712 in the same process, and 
continue. Finally, we will find all variables in cr and back 
up the original restriction p. 

Now to count the length of the auxiliary information, the 
total length is t\ogu + tlogw + 2t — 1 bits. Therefore, 
we can have a one-to-one mapping between elements in 
and x A, in which the size of A is bounded by 

2 tlogu-|-Zlog«)-|- 2 t—1 _ . 2 ^*“^ 

In all. 


Proof of Theorem 3 

For each term in the Fourier expansion whose degree is less 
than or equal to d, we can treat this term as a weighted func¬ 
tion involving less than or equal to d variables. Therefore, 
it can be represented by a decision tree, in which each path 
of the tree involves no more than d variables (therefore the 
tree is at most at the depth of d). Because / is represented 
as the sum over a set of Fourier terms up to degree d, it can 
be also represented as the sum of the corresponding deci¬ 
sion trees. 

Proof of Theorem 6 

Let the Fourier expansion of / be: /(x) = 

Es/(>5')xs(x), we have: 

fi^\Xi) 

=/(x \ Xi, Xi = -fl) -f /(x \ Xj, Xj = -1) 

= \ ■ 1 + 

S:ieS 

fiS)■Xs\^i^\x^) 

+ Y /(-S') •Xsv(x\a:i) • (-l)-f 
S:ieS 

= Y \ ■ 1+ 

S:ieS 

Y • (-1) 

S:ieS 

+ Y /(-S') •Xsv(x\xi)-f 
S:ii^S 

= Y 2-/('S')-Xsv(x\a:*). 


m - 

(?) 

l{l-l)...{l-t + l) 1 
(n — ( -I- 1)... (n — ( -I- f) 2 

< I f-’—Suw] . 

2 \n — I J 


(7) 

( 8 ) 
(9) 

(10) 




